23 research outputs found
A Speaker Diarization System for Studying Peer-Led Team Learning Groups
Peer-led team learning (PLTL) is a model for teaching STEM courses where
small student groups meet periodically to collaboratively discuss coursework.
Automatic analysis of PLTL sessions would help education researchers to get
insight into how learning outcomes are impacted by individual participation,
group behavior, team dynamics, etc.. Towards this, speech and language
technology can help, and speaker diarization technology will lay the foundation
for analysis. In this study, a new corpus is established called CRSS-PLTL, that
contains speech data from 5 PLTL teams over a semester (10 sessions per team
with 5-to-8 participants in each team). In CRSS-PLTL, every participant wears a
LENA device (portable audio recorder) that provides multiple audio recordings
of the event. Our proposed solution is unsupervised and contains a new online
speaker change detection algorithm, termed G 3 algorithm in conjunction with
Hausdorff-distance based clustering to provide improved detection accuracy.
Additionally, we also exploit cross channel information to refine our
diarization hypothesis. The proposed system provides good improvements in
diarization error rate (DER) over the baseline LIUM system. We also present
higher level analysis such as the number of conversational turns taken in a
session, and speaking-time duration (participation) for each speaker.Comment: 5 Pages, 2 Figures, 2 Tables, Proceedings of INTERSPEECH 2016, San
Francisco, US
FEARLESS STEPS Challenge (FS-2): Supervised Learning with Massive Naturalistic Apollo Data
The Fearless Steps Initiative by UTDallas-CRSS led to the digitization,
recovery, and diarization of 19,000 hours of original analog audio data, as
well as the development of algorithms to extract meaningful information from
this multi-channel naturalistic data resource. The 2020 FEARLESS STEPS (FS-2)
Challenge is the second annual challenge held for the Speech and Language
Technology community to motivate supervised learning algorithm development for
multi-party and multi-stream naturalistic audio. In this paper, we present an
overview of the challenge sub-tasks, data, performance metrics, and lessons
learned from Phase-2 of the Fearless Steps Challenge (FS-2). We present
advancements made in FS-2 through extensive community outreach and feedback. We
describe innovations in the challenge corpus development, and present revised
baseline results. We finally discuss the challenge outcome and general trends
in system development across both phases (Phase FS-1 Unsupervised, and Phase
FS-2 Supervised) of the challenge, and its continuation into multi-channel
challenge tasks for the upcoming Fearless Steps Challenge Phase-3.Comment: Paper Accepted in the Interspeech 2020 Conferenc
Novel statistical voice activity detectors
In this thesis, we propose a few practical statistical voice activity detectors (VADs) which combine the voice activity information in the short-term and long-term statistics of the speech signal. Unlike most VADs, which assume that the cues to activity lie within the frame alone, the proposed VAD schemes seek information for activity in the current as well as the neighboring frames. Particularly, we develop primary and contextual detectors to process the short-term and long-term information, respectively. We use the perceptual Ephraim-Malah (PEM) model to develop three primary detectors based on the Bayesian, Neyman-Pearson (NP) and competitive NP (CNP) approaches. Moreover, upon viewing voice activity detection as a composite hypothesis where the prior signal-to-noise ratio (SNR) forms the free parameter, we reveal that a correlation between the prior SNR and the hypothesis exists, i.e., a high prior SNR is more likely to be associated with 'speech hypothesis' than the 'pause hypothesis' and vice-versa, and unlike the Bayesian and NP approaches, the CNP approach alone exploits this correlatio
Automatic language analysis and identification based on speech production knowledge
In this paper, a language analysis and classification system that lever-ages knowledge of speech production is proposed. The proposed scheme automatically extracts key production traits (or “hot-spots”) that are strongly tied to the underlying language structure. Particu-larly, the speech utterance is first parsed into consonant and vowel clusters. Subsequently, the production traits for each cluster is rep-resented by the corresponding temporal evolution of speech articu-latory states. It is hypothesized that a selection of these production traits are strongly tied to the underlying language, and can be ex-ploited for language ID. The new scheme is evaluated on our South Indian Languages (SInL) corpus which consists of 5 closely related languages spoken in India, namely, Kannada, Tamil, Telegu, Malay-alam, and Marathi. Good accuracy is achieved with a rate of 65% obtained in a difficult 5-way classification task with about 4sec of train and test speech data per utterance. Furthermore, the proposed scheme is also able to automatically identify key production traits of each language (e.g., dominant vowels, stop-consonants, fricatives etc.)